Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#116 Allow token processing "middleware" #144

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

nsantini
Copy link

I added node-spellchecker to check for typos, and also "levenshtein" to find the closest spell correction to the original word.

I also modified the "negation" feature to look backwards until a negation word or a new afinn word is found, to cover cases like "not too bad".

lib/index.js Outdated
if (!labels.hasOwnProperty(obj)) continue;

// Check for negation
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I originally forked the repo, here we used to check if the previous word was a negation and invert the score. Now seems like we deal with negation further down. So we end up negating the score twice. Might be that my change to deal with negative words is redundant now

Copy link
Collaborator

@elyas-bhy elyas-bhy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nsantini, great work. Just a couple of points:

  1. The spell-checking feature looks good to me. However, we have to consider if we should add an option to enable / disable this feature (are there any use-cases where disabling spell-checking is required? Performance perhaps?). @thisandagain thoughts?

  2. Regarding negation: since PR Add support for additional languages #128 for supporting additional languages, this logic has now moved to /languages/en/scoring-strategy.js, or more generally /languages/[language]/scoring-strategy.js. The scoring strategy determines what score to assign to a given token, and allows processing negation, emphasis, and any other language specifics.
    You are currently performing the negation twice, that is what's making your tests fail.

  3. It would be great to have a few validation tests for the backward lookup feature, there doesn't seem to be any.

@nsantini
Copy link
Author

Ok

  • moved the negation backwards strategy to the language model
  • added unit test for it

@nsantini
Copy link
Author

Finally, made spell checking optional, and unit tested it

@elyas-bhy
Copy link
Collaborator

Thanks for the quick update @nsantini, great work!
A few more things:

  1. You seem to have added the distance/spellchecking feature exclusively to the english language (the only officially supported language for now, but other languages can be dynamically registered, see Adding new languages.
    What would it take to make this feature accessible to all languages? Is there a way to set a locale for the spellchecker? If so, we can move the distance/spellchecking modules up a level, and make them available to all languages.

  2. Could you please add the necessary documentation to the README file? Documenting the spellcheck flag would be quite helpful.

  3. Tests look good to me. If you manage to get around the first point of this post, consider adding another test to make sure that your changes are also supported for a new language.

We're almost there!

@nsantini
Copy link
Author

The current implementation uses https://www.npmjs.com/package/spellchecker , which are native bindings to NSSpellChecker, Hunspell, or the Windows 8 Spell Check API, depending on your platform. Windows 7 and below as well as Linux will rely on Hunspell. So it depends on what language you have in your system. Also it depends on those subsystems to support multiple languages.
So, in a way, the spellchecking implementation supports multiple languages.
I added the method signature to the language definition, so other language could support different implementations of the spellchecker.
So the question would be, do we want all language spellcheckers to be implemented in one way depending on the used library, or allow each language to use what they want?

@elyas-bhy
Copy link
Collaborator

Does this mean that, for example, if my OS is set to use the spanish locale, then the spellchecker module would always try to spellcheck with spanish as the target language, regardless of the actual language of the input? If this is the case, I feel like this would lead to a very inconsistent behaviour.

Is there a way to specify to the spellchecker which language to use?
/cc @thisandagain, thoughts?

@nsantini
Copy link
Author

There is a mechanism to add a dictionary to the spellchecker, I believe thats how a language gets "set"

@nsantini
Copy link
Author

I have refactored the spell checking functionality to use a different library nspell that can support multiple languages.
Also updated README with examples of how to use spell checking

@elyas-bhy
Copy link
Collaborator

This is all great work, we're almost there!

  1. If we officially add support to one or more new languages, we would have to duplicate the code for the spellchecking feature (since it is currently located in languages/en).
    Do you think you can move the spellchecking / distance logic up a couple of levels and make it part of the core module (lib)? This would make the feature available to all languages.
    At the language-level, I think it is best if we only provide the appropriate dictionary, and delegate the rest of the processing to the core module. It would make adding new languages much easier.

For example:

// languages/fr/index.js
module.exports = {
    labels: require('./labels.json'),
    dictionary: require('dictionary-fr'),  // specify the language dictionary here for spellchecking
    scoringStrategy: require('./scoring-strategy')
};

  1. I have noticed a drastic performance hit when comparing before/after this PR:
    Before:
sentiment (Latest) - Short  x 561,830 ops/sec ±1.76% (92 runs sampled)
sentiment (Latest) - Long   x 2,689 ops/sec ±1.41% (87 runs sampled)
Sentimental (1.0.1) - Short x 314,373 ops/sec ±1.50% (89 runs sampled)
Sentimental (1.0.1) - Long  x 1,171 ops/sec ±2.00% (88 runs sampled)

After:

sentiment (Latest) - Short  x 239,996 ops/sec ±1.64% (91 runs sampled)
sentiment (Latest) - Long   x 0.21 ops/sec ±3.55% (5 runs sampled)
Sentimental (1.0.1) - Short x 236,762 ops/sec ±5.24% (82 runs sampled)
Sentimental (1.0.1) - Long  x 895 ops/sec ±2.54% (80 runs sampled)

Ideally there should be barely any impact when the spellchecking feature is disabled (which it should be, by default). Could you please investigate this issue?


  1. I have also noticed a slight decrease in code coverage (npm run test:coverage). Could you add a few more tests to cover those cases?

  1. And finally, do you mind documenting this feature in the API Reference section of the README file as well?

Thank you for your patience @nsantini and for bearing with me!

@thisandagain thisandagain assigned nsantini and unassigned elyas-bhy Jun 19, 2018
@thisandagain
Copy link
Owner

thisandagain commented Jun 19, 2018

In addition to the performance implications of this change necessitating the feature be "off" by default, I think the validation tests also strongly suggest that this should be optional:

Before

Amazon accuracy: 0.7202797202797203
IMDB accuracy: 0.7642357642357642
Yelp accuracy: 0.6943056943056943

After

Amazon accuracy: 0.7302697302697303
IMDB accuracy: 0.7532467532467533
Yelp accuracy: 0.6943056943056943

Nice improvement (~1%) on the Amazon validation set, but the IMDB accuracy dropped ~1.1% and Yelp stayed stable. This may suggest that the change is indeed helping with less formal speech, but is actually causing false positives in a more formal corpus (or vs versa), but more investigation would be required.

Copy link
Owner

@thisandagain thisandagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @nsantini! This is great to see. Agreed with @elyas-bhy's suggestions above. I also would love to further discuss the edit distance strategy and validation test differences as described in the comments.


/**
* Finds the closest match between a statement and a body of words using
* Levenshtein Distance
Copy link
Owner

@thisandagain thisandagain Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Levenshtein Distance is a great performant strategy, but if we are going to make this optional anyway we might want to discuss / test alternative edit distance algorithms (e.g. Myers Diff Algorithm).

PDF:
myers.pdf

@nsantini
Copy link
Author

Ok, moved the spell checking to the library and left loading the dictionary to the language. Added extra unit test, only two lines on the spell checking are not covered, but they are there for safekeeping of an edge case that shouldnt happen.
Looking into performance and accuracy:

sentiment (Latest) - Short x 493,203 ops/sec ±1.14% (90 runs sampled)
sentiment (Latest) - Long x 1,452 ops/sec ±3.65% (82 runs sampled)
Sentimental (1.0.1) - Short x 306,074 ops/sec ±2.98% (86 runs sampled)
Sentimental (1.0.1) - Long x 1,283 ops/sec ±2.45% (87 runs sampled)

Amazon accuracy: 0.7302697302697303
IMDB accuracy: 0.7532467532467533
Yelp accuracy: 0.6943056943056943

The changes to use spell checking, off by default, should not have affected accuracy, since I haven't changed how is tested.

if (dictiaonary === null) {
var base = require.resolve('dictionary-en-us');
dictiaonary = {
'aff': read(base.replace('.js', '.aff'), 'utf-8'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why do we need to perform such gymnastics when loading the dictionary, instead of simply requiring it and passing it over to nspell, as described in their usage example?

Also, there is a typo in the spelling of the "dictionary" variable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the typo.

Regarding "gymnastics": the dictionary can be loaded on an async fashion. But the whole sentiment library works synchronously. So I decided to not pollute the whole library with callbacks or promises, and given that I couldn't use async/await to do it (eslint complained about it) I decided to load it this way.

README.md Outdated
getDictionary: {
apply: function() {
// Load a dictionary for the language for nspell to use
return { aff, dic };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these aff and dic properties correspond to? Could you give an example?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained more in the README file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be the case, are you sure that you have pushed your changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new commit, it references the explanation I gave in the "Spell checked example" section

@nsantini
Copy link
Author

Hi, latest changes:

  • README references the documentation of nspell and how to add dictionaries

@elyas-bhy
Copy link
Collaborator

LGTM. @thisandagain?

@nsantini
Copy link
Author

@thisandagain any more comments on this?

Regarding your comment about looking for alternative string distance algorithms, I would suggest leave this PR as it is (since the current implementation is performant and correct), and open a new issue to look for alternatives

@nsantini
Copy link
Author

@thisandagain any calls regarding this PR?

@nsantini
Copy link
Author

@elyas-bhy @thisandagain any updates?

@elyas-bhy
Copy link
Collaborator

Everything looks good on my side. Still waiting for @thisandagain to approve the changes.

@nsantini
Copy link
Author

nsantini commented Sep 4, 2018

hi @thisandagain , just checking if you got a change to review the changes

@elyas-bhy
Copy link
Collaborator

Ping @thisandagain.

@nsantini
Copy link
Author

Should I close this PR?
@thisandagain @elyas-bhy

@elyas-bhy
Copy link
Collaborator

I'm really looking forward to get this merged, but need @thisandagain to approve and merge this PR.

@elyas-bhy
Copy link
Collaborator

Ping @thisandagain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants